WHAT IS CLAIMED IS: 



1 . A distributed data cache as memory or register file comprising: 
a first cache memory unit having a plurality of cache ports; and 

a plurahty of data buses connected with the cache memory unit, wherein each 
of the plurality of data buses is connected with the plurality of cache ports of the cache 
memory unit. 

2. The distributed data cache of Claim 1, further comprising a data path 
adapted for processing data and having at least one data input and at least one data output, 
wherein the data input and data output are connected with the plurality of data buses. 

3. The distributed data cache of Claim 2, further comprising a multiplexer 
for alternately connecting the data input with each of the plurality of data buses. 

4. The distributed data cache of Claim 2, further comprising a multiplexer 
for alternately connecting the data output with each of the plurality of data buses. 

5. The distributed data cache of Claim 1, further comprising a plurality of 
data address generators connected with a memory unit and the plurality of data buses without 
latency. 

6. The distributed data cache of Claim 5, wherein the plurality of data 
address generators are adapted to retrieve a plurality of data values from the memory unit and 
communicate the plurality of data values to the plurality of data buses directly without any 
latency due to registering. 

7. The distributed data cache of Claim 6, wherein the plurality of data 
address generators are adapted to simultaneously communicate the plurality of data values to 
the plurality of data buses, wherein each of the plurality of data values is communicated to a 
different one of the plurahty of data buses. 

8. The distributed data cache of Claim 7, wherein the first cache memory 
unit is adapted to simultaneously load a plurality of data values from the plurality of data 
buses, such that each of the plurality of data values is loaded in a different one of the plurality 
of cache lines of the first cache memory unit through the same port. 
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9. The distributed data cache of Claim 1 , wherein the number of cache 
Hnes of the first cache memory unit are equal to the number of data buses. 

10. The distributed data cache of Claim 1 , further comprising at least one 
additional cache memory unit also having a plurality of cache lines, wherein each cache line 
of the additional cache memory unit is connected with the plurality of data buses. 

1 1 . The distributed data cache of Claim 10, wherein the total number of 
cache memory units is equal to the number of cache lines in each cache memory unit. 

12. An apparatus for transposing a plurality of data values arranged in a 
matrix, the apparatus comprising: 

a plurality of cache memory units, each cache memory unit having a plurality 
of cache ports; and 

a plurality of data buses, each data bus connected with a different one of the 
plurality of cache ports from each of the cache memory or register file units. 

13. The apparatus of Claim 12, further comprising a plurality of data 
address generators adapted to retrieve a plurality of data values from the memory unit and 
communicate the plurality of data values to the plurality of data buses without any latency. 

14. The apparatus of Claim 13, wherein the plurality of data values 
comprises a plurality of sets of data values. 

15. The apparatus of Claim 14, wherein the plurality of data address 
generators are adapted to sequentially communicate the plurality of sets of data values with 
the plurality of data buses without any register latency. 

16. The apparatus of Claim 1 5, wherein the plurality of data address 
generators are adapted to simultaneously communicate the data values of each set of data 
values to the plurality of data buses without any register latency. 

17. The apparatus of Claim 16, wherein each set of data values is a matrix 

row. 
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1 8. The apparatus of Claim 16, wherein each set of data values is a matrix 

column. 

19. The apparatus of Claim 16, wherein each of the cache memory units is 
adapted to simultaneously load one of the sets of data values from the plurality of data buses, 
such that each data value of the set of data values is loaded in a different one of the plurality 
of cache ports of the cache memory unit or register file. 

20. A method for transposing a plurality of data values arranged in a 
matrix, the method comprising: 

retrieving a first subset of data values from the plurality of data values from a 

memory unit; 

simultaneously transferring the first subset of data values to a plurality of data 
buses, wherein each data value of the first subset is transferred to a different one of the 
plurality of data buses; and 

simultaneously loading the first subset of data values from the plurality of data 
buses to a first cache memory unit having a plurality of cache lines, wherein each cache port 
receives a data value from a different one of the plurality of data buses without any latency. 
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